In [ ]:
!pip install pgeocode
Defaulting to user installation because normal site-packages is not writeable
Collecting pgeocode
  Downloading pgeocode-0.4.1-py3-none-any.whl.metadata (8.4 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.8/dist-packages (from pgeocode) (2.31.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from pgeocode) (1.23.2)
Requirement already satisfied: pandas in /usr/local/lib/python3.8/dist-packages (from pgeocode) (1.5.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas->pgeocode) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.8/dist-packages (from pandas->pgeocode) (2022.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.8/dist-packages (from requests->pgeocode) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests->pgeocode) (2.8)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/lib/python3/dist-packages (from requests->pgeocode) (1.25.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->pgeocode) (2019.11.28)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas->pgeocode) (1.14.0)
Downloading pgeocode-0.4.1-py3-none-any.whl (9.8 kB)
Installing collected packages: pgeocode
Successfully installed pgeocode-0.4.1

Importing libraries

  • pandas: To read data into dataframes
  • pgeocode: Used to get lat and long of the zipcodes
  • Matplotlib: For plotting
  • Folium: For plotting on a map
  • Ticker: to determine to plot how many ticks in the plot
In [ ]:
import pandas as pd
import pgeocode
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap
import numpy as np
import matplotlib.ticker as ticker
import seaborn as sns
In [ ]:
# Reading sheet regency data to a dataframe
regencyData = pd.read_excel('data.xlsx', sheet_name='Regency_Data', dtype={'Zip Code': str})
In [ ]:
regencyData.head().T
Out[ ]:
0 1 2 3 4
Business Unit NaN 15.0 15.0 15.0 15.0
Division NaN Northeast Northeast Northeast Northeast
Region NaN PA-DE PA-DE PA-DE PA-DE
State NaN Pennsylvania Pennsylvania Pennsylvania Pennsylvania
City NaN Wayne Wayne Wayne Wayne
... ... ... ... ... ...
2023-04-01 00:00:00 NaN 0.0 0.0 0.0 0.0
2023-05-01 00:00:00 NaN 0.0 0.0 0.0 0.0
2023-06-01 00:00:00 NaN 0.0 0.0 0.0 0.0
2023-07-01 00:00:00 NaN 0.0 0.0 0.0 0.0
2023-08-01 00:00:00 NaN 0.0 0.0 0.0 0.0

298 rows × 5 columns

In [ ]:
# Replace substring in column names
#test = regencyData.copy()
for column in regencyData.columns:
    name = str(column).replace('-01 00:00:00', '')
    regencyData.rename(columns={column: "{}".format(name)}, inplace=True)
    print(name)
Business Unit
Division
Region
State
City
Zip Code
Gross Leasable Area (SF)
Year Constructed
Year of last Completed Redev
Year Acquired
Control
Category
Expense
Description
2000-01
2000-02
2000-03
2000-04
2000-05
2000-06
2000-07
2000-08
2000-09
2000-10
2000-11
2000-12
2001-01
2001-02
2001-03
2001-04
2001-05
2001-06
2001-07
2001-08
2001-09
2001-10
2001-11
2001-12
2002-01
2002-02
2002-03
2002-04
2002-05
2002-06
2002-07
2002-08
2002-09
2002-10
2002-11
2002-12
2003-01
2003-02
2003-03
2003-04
2003-05
2003-06
2003-07
2003-08
2003-09
2003-10
2003-11
2003-12
2004-01
2004-02
2004-03
2004-04
2004-05
2004-06
2004-07
2004-08
2004-09
2004-10
2004-11
2004-12
2005-01
2005-02
2005-03
2005-04
2005-05
2005-06
2005-07
2005-08
2005-09
2005-10
2005-11
2005-12
2006-01
2006-02
2006-03
2006-04
2006-05
2006-06
2006-07
2006-08
2006-09
2006-10
2006-11
2006-12
2007-01
2007-02
2007-03
2007-04
2007-05
2007-06
2007-07
2007-08
2007-09
2007-10
2007-11
2007-12
2008-01
2008-02
2008-03
2008-04
2008-05
2008-06
2008-07
2008-08
2008-09
2008-10
2008-11
2008-12
2009-01
2009-02
2009-03
2009-04
2009-05
2009-06
2009-07
2009-08
2009-09
2009-10
2009-11
2009-12
2010-01
2010-02
2010-03
2010-04
2010-05
2010-06
2010-07
2010-08
2010-09
2010-10
2010-11
2010-12
2011-01
2011-02
2011-03
2011-04
2011-05
2011-06
2011-07
2011-08
2011-09
2011-10
2011-11
2011-12
2012-01
2012-02
2012-03
2012-04
2012-05
2012-06
2012-07
2012-08
2012-09
2012-10
2012-11
2012-12
2013-01
2013-02
2013-03
2013-04
2013-05
2013-06
2013-07
2013-08
2013-09
2013-10
2013-11
2013-12
2014-01
2014-02
2014-03
2014-04
2014-05
2014-06
2014-07
2014-08
2014-09
2014-10
2014-11
2014-12
2015-01
2015-02
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
2017-12
2018-01
2018-02
2018-03
2018-04
2018-05
2018-06
2018-07
2018-08
2018-09
2018-10
2018-11
2018-12
2019-01
2019-02
2019-03
2019-04
2019-05
2019-06
2019-07
2019-08
2019-09
2019-10
2019-11
2019-12
2020-01
2020-02
2020-03
2020-04
2020-05
2020-06
2020-07
2020-08
2020-09
2020-10
2020-11
2020-12
2021-01
2021-02
2021-03
2021-04
2021-05
2021-06
2021-07
2021-08
2021-09
2021-10
2021-11
2021-12
2022-01
2022-02
2022-03
2022-04
2022-05
2022-06
2022-07
2022-08
2022-09
2022-10
2022-11
2022-12
2023-01
2023-02
2023-03
2023-04
2023-05
2023-06
2023-07
2023-08

Used describe in order to get statistical analysis on numerical and categorical variables:

  • Gross Leasable Area (SF) is missing a few attributes so I will remove them from the data
  • Oldest building is from 1906 and newest is in 2018. Will explore how much that affectes the expenses.
  • Company has most of items work in Florida. However, they are most concentrated in the city of Atlanta
In [ ]:
regencyData.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Business Unit 29488.0 258452.237317 1.165765e+06 15.00 629.0000 60706.0 80101.0 8010204.00
Gross Leasable Area (SF) 29386.0 128452.048288 1.011916e+05 0.00 78820.0000 106482.0 152294.0 1072259.00
Year Constructed 29488.0 1988.729110 1.719589e+01 1906.00 1979.0000 1990.0 2000.0 2018.00
Year Acquired 29488.0 2007.181158 7.701766e+00 1993.00 1999.0000 2005.0 2016.0 2021.00
2000-01 29488.0 -79.443192 1.022161e+03 -51688.00 0.0000 0.0 0.0 11523.00
... ... ... ... ... ... ... ... ...
2023-04 29488.0 -1157.239697 7.341190e+03 -486572.78 -297.9375 0.0 0.0 32292.47
2023-05 29488.0 -1165.502560 7.719511e+03 -540258.47 -264.1175 0.0 0.0 66000.00
2023-06 29488.0 -1185.444853 7.767677e+03 -520149.82 -208.6850 0.0 0.0 300000.00
2023-07 29488.0 -1177.278201 7.493314e+03 -526337.03 -240.3650 0.0 0.0 31393.03
2023-08 29488.0 -1146.933299 7.525011e+03 -526337.03 -227.7650 0.0 0.0 96064.35

288 rows × 8 columns

In [ ]:
regencyData.describe(include=['O']).T
Out[ ]:
count unique top freq
Division 29488 4 Southeast 9737
Region 29488 20 VA-MD-DC 2786
State 29488 25 Florida 6235
City 29488 248 Atlanta 911
Zip Code 29488 340 6824 397
Year of last Completed Redev 14494 35 0 3633
Control 29330 3 Controllable 15957
Category 29488 4 Operating Expenses 21753
Expense 29488 51 Life Safety 1899
Description 29488 375 INSURANCE PROPERTY 423
In [ ]:
# Seeing how many unity they have by checking the number of unique values in UnitID
regencyData['Business Unit'].nunique()
Out[ ]:
435
  • There a lot of missing values for Year of last Completed Redev. However, that might indicate that it didn't need redevelopment. will replace missing values with 'Never'.
  • Control is missing so will remove missing rows
  • Gross Leasable Area (SF) is missing some values so will remove them.
  • All attribus are missing one value, most likely due to the filtering row in the excel file.
In [ ]:
#Getting Nan values for every column
nan_count = regencyData.isna().sum().sort_values(ascending=False)
for index, row in nan_count.items():
    print(f'{index}   {row}')
Year of last Completed Redev   14995
Control   159
Gross Leasable Area (SF)   103
2015-04   1
2015-11   1
2015-10   1
2015-09   1
2015-08   1
2015-07   1
2015-06   1
2015-05   1
Business Unit   1
2015-03   1
2016-01   1
2015-02   1
2015-01   1
2014-12   1
2014-11   1
2014-10   1
2014-09   1
2015-12   1
2016-03   1
2016-02   1
2016-11   1
2017-05   1
2017-04   1
2017-03   1
2017-02   1
2017-01   1
2016-12   1
2016-10   1
2014-07   1
2016-09   1
2016-08   1
2016-07   1
2016-06   1
2016-05   1
2016-04   1
2014-08   1
2014-05   1
2014-06   1
2017-07   1
2012-10   1
2012-09   1
2012-08   1
2012-07   1
2012-06   1
2012-05   1
2012-04   1
2012-03   1
2012-02   1
2012-01   1
2011-12   1
2011-11   1
2011-10   1
2011-09   1
2011-08   1
2012-11   1
2012-12   1
2013-01   1
2013-10   1
2014-04   1
2014-03   1
2014-02   1
2014-01   1
2013-12   1
2013-11   1
2013-09   1
2013-02   1
2013-08   1
2013-07   1
2013-06   1
2013-05   1
2013-04   1
2013-03   1
2017-06   1
2017-09   1
2017-08   1
2020-09   1
2022-01   1
2021-12   1
2021-11   1
2021-10   1
2021-09   1
2021-08   1
2021-07   1
2021-06   1
2021-05   1
2021-04   1
2021-03   1
2021-02   1
2021-01   1
2020-12   1
2020-11   1
2022-02   1
2022-03   1
2022-04   1
2023-01   1
2023-07   1
2023-06   1
2023-05   1
2023-04   1
2023-03   1
2023-02   1
2022-12   1
2022-05   1
2022-11   1
2022-10   1
2022-09   1
2022-08   1
2022-07   1
2022-06   1
2020-10   1
2020-08   1
2011-06   1
2020-07   1
2018-12   1
2018-11   1
2018-10   1
2018-09   1
2018-08   1
2018-07   1
2018-06   1
2018-05   1
2018-04   1
2018-03   1
2018-02   1
2018-01   1
2017-12   1
2017-11   1
2017-10   1
2019-01   1
2019-02   1
2019-03   1
2019-12   1
2020-06   1
2020-05   1
2020-04   1
2020-03   1
2020-02   1
2020-01   1
2019-11   1
2019-04   1
2019-10   1
2019-09   1
2019-08   1
2019-07   1
2019-06   1
2019-05   1
2011-07   1
2011-04   1
2011-05   1
2002-03   1
2003-07   1
2003-06   1
2003-05   1
2003-04   1
2003-03   1
2003-02   1
2003-01   1
2002-12   1
2002-11   1
2002-10   1
2002-09   1
2002-08   1
2002-07   1
2002-06   1
2002-05   1
2003-08   1
2003-09   1
2003-10   1
2004-07   1
2005-01   1
2004-12   1
2004-11   1
2004-10   1
2004-09   1
2004-08   1
2004-06   1
2003-11   1
2004-05   1
2004-04   1
2004-03   1
2004-02   1
2004-01   1
2003-12   1
2002-04   1
2002-02   1
2005-03   1
2002-01   1
2000-06   1
2000-05   1
2000-04   1
2000-03   1
2000-02   1
2000-01   1
Description   1
Expense   1
Category   1
Year Acquired   1
Year Constructed   1
Zip Code   1
City   1
State   1
Region   1
2000-07   1
2000-08   1
2000-09   1
2001-06   1
2001-12   1
2001-11   1
2001-10   1
2001-09   1
2001-08   1
2001-07   1
2001-05   1
2000-10   1
2001-04   1
2001-03   1
2001-02   1
2001-01   1
2000-12   1
2000-11   1
2005-02   1
2005-04   1
Division   1
2008-05   1
2009-09   1
2009-08   1
2009-07   1
2009-06   1
2009-05   1
2009-04   1
2009-03   1
2009-02   1
2009-01   1
2008-12   1
2008-11   1
2008-10   1
2008-09   1
2008-08   1
2008-07   1
2009-10   1
2009-11   1
2009-12   1
2010-09   1
2011-03   1
2011-02   1
2011-01   1
2010-12   1
2010-11   1
2010-10   1
2010-08   1
2010-01   1
2010-07   1
2010-06   1
2010-05   1
2010-04   1
2010-03   1
2010-02   1
2008-06   1
2008-04   1
2005-05   1
2008-03   1
2006-08   1
2006-07   1
2006-06   1
2006-05   1
2006-04   1
2006-03   1
2006-02   1
2006-01   1
2005-12   1
2005-11   1
2005-10   1
2005-09   1
2005-08   1
2005-07   1
2005-06   1
2006-09   1
2006-10   1
2006-11   1
2007-08   1
2008-02   1
2008-01   1
2007-12   1
2007-11   1
2007-10   1
2007-09   1
2007-07   1
2006-12   1
2007-06   1
2007-05   1
2007-04   1
2007-03   1
2007-02   1
2007-01   1
2023-08   1
In [ ]:
# Removing first row in dataframe
regencyData = regencyData.iloc[1:]
In [ ]:
# Filling redeveloment data with -1 values
regencyData['Year of last Completed Redev'] = regencyData['Year of last Completed Redev'].fillna('Never')
In [ ]:
# Dropping null Control and Gross Leasable area rows
regencyData = regencyData.dropna(subset=['Control','Gross Leasable Area (SF)'])
In [ ]:
#Getting Nan values for every column, checking no null values are left
nan_count = regencyData.isna().sum().sort_values(ascending=False)
for index, row in nan_count.items():
    print(f'{index}   {row}')
Business Unit   0
2015-11   0
2015-09   0
2015-08   0
2015-07   0
2015-06   0
2015-05   0
2015-04   0
2015-03   0
2015-02   0
2015-01   0
2014-12   0
2014-11   0
2014-10   0
2014-09   0
2014-08   0
2014-07   0
2015-10   0
2015-12   0
2017-06   0
2016-01   0
2017-04   0
2017-03   0
2017-02   0
2017-01   0
2016-12   0
2016-11   0
2016-10   0
2016-09   0
2016-08   0
2016-07   0
2016-06   0
2016-05   0
2016-04   0
2016-03   0
2016-02   0
2014-06   0
2014-05   0
2014-04   0
2014-03   0
2012-08   0
2012-07   0
2012-06   0
2012-05   0
2012-04   0
2012-03   0
2012-02   0
2012-01   0
2011-12   0
2011-11   0
2011-10   0
2011-09   0
2011-08   0
2011-07   0
2011-06   0
2012-09   0
2012-10   0
2012-11   0
2013-08   0
2014-02   0
2014-01   0
2013-12   0
2013-11   0
2013-10   0
2013-09   0
2013-07   0
2012-12   0
2013-06   0
2013-05   0
2013-04   0
2013-03   0
2013-02   0
2013-01   0
2017-05   0
2017-07   0
Division   0
2022-02   0
2021-12   0
2021-11   0
2021-10   0
2021-09   0
2021-08   0
2021-07   0
2021-06   0
2021-05   0
2021-04   0
2021-03   0
2021-02   0
2021-01   0
2020-12   0
2020-11   0
2020-10   0
2022-01   0
2022-03   0
2017-08   0
2022-04   0
2023-07   0
2023-06   0
2023-05   0
2023-04   0
2023-03   0
2023-02   0
2023-01   0
2022-12   0
2022-11   0
2022-10   0
2022-09   0
2022-08   0
2022-07   0
2022-06   0
2022-05   0
2020-09   0
2020-08   0
2020-07   0
2020-06   0
2018-11   0
2018-10   0
2018-09   0
2018-08   0
2018-07   0
2018-06   0
2018-05   0
2018-04   0
2018-03   0
2018-02   0
2018-01   0
2017-12   0
2017-11   0
2017-10   0
2017-09   0
2018-12   0
2019-01   0
2019-02   0
2019-11   0
2020-05   0
2020-04   0
2020-03   0
2020-02   0
2020-01   0
2019-12   0
2019-10   0
2019-03   0
2019-09   0
2019-08   0
2019-07   0
2019-06   0
2019-05   0
2019-04   0
2011-05   0
2011-04   0
2011-03   0
2003-06   0
2003-04   0
2003-03   0
2003-02   0
2003-01   0
2002-12   0
2002-11   0
2002-10   0
2002-09   0
2002-08   0
2002-07   0
2002-06   0
2002-05   0
2002-04   0
2002-03   0
2002-02   0
2003-05   0
2003-07   0
2011-02   0
2003-08   0
2004-11   0
2004-10   0
2004-09   0
2004-08   0
2004-07   0
2004-06   0
2004-05   0
2004-04   0
2004-03   0
2004-02   0
2004-01   0
2003-12   0
2003-11   0
2003-10   0
2003-09   0
2002-01   0
2001-12   0
2001-11   0
2001-10   0
2000-03   0
2000-02   0
2000-01   0
Description   0
Expense   0
Category   0
Control   0
Year Acquired   0
Year of last Completed Redev   0
Year Constructed   0
Gross Leasable Area (SF)   0
Zip Code   0
City   0
State   0
Region   0
2000-04   0
2000-05   0
2000-06   0
2001-03   0
2001-09   0
2001-08   0
2001-07   0
2001-06   0
2001-05   0
2001-04   0
2001-02   0
2000-07   0
2001-01   0
2000-12   0
2000-11   0
2000-10   0
2000-09   0
2000-08   0
2004-12   0
2005-01   0
2005-02   0
2008-03   0
2009-07   0
2009-06   0
2009-05   0
2009-04   0
2009-03   0
2009-02   0
2009-01   0
2008-12   0
2008-11   0
2008-10   0
2008-09   0
2008-08   0
2008-07   0
2008-06   0
2008-05   0
2009-08   0
2009-09   0
2009-10   0
2010-07   0
2011-01   0
2010-12   0
2010-11   0
2010-10   0
2010-09   0
2010-08   0
2010-06   0
2009-11   0
2010-05   0
2010-04   0
2010-03   0
2010-02   0
2010-01   0
2009-12   0
2008-04   0
2008-02   0
2005-03   0
2008-01   0
2006-06   0
2006-05   0
2006-04   0
2006-03   0
2006-02   0
2006-01   0
2005-12   0
2005-11   0
2005-10   0
2005-09   0
2005-08   0
2005-07   0
2005-06   0
2005-05   0
2005-04   0
2006-07   0
2006-08   0
2006-09   0
2007-06   0
2007-12   0
2007-11   0
2007-10   0
2007-09   0
2007-08   0
2007-07   0
2007-05   0
2006-10   0
2007-04   0
2007-03   0
2007-02   0
2007-01   0
2006-12   0
2006-11   0
2023-08   0

We will create a new dataframe called unit_df where it only includes information related to units. This step is done in order to to focus on variables related to the units only without having duplicate rows due to expenses.

In [ ]:
# Getting only the unique Id in the Business Unit
unitId = regencyData['Business Unit'].unique()
In [ ]:
#Creating new dataframe with data related to unit information without repairs
unit_df = pd.DataFrame(columns=['Business Unit', 'Division', 'Region', 'State', 'Zip Code', 'Lat', 'Long', 'Year Constructed', 'Year Acquired', 'Expenses'])

for unit in unitId:
    filtered_row = regencyData[regencyData['Business Unit'] == unit]
    if not filtered_row.empty:
        # Getting only first row with unique id
        first_row = filtered_row.iloc[0]
        empty_row = {'Business Unit': first_row['Business Unit'], 'Division': first_row['Division'], 'Region': first_row['Region'],
                    'State': first_row['State'], 'Zip Code': first_row['Zip Code'], 'Lat': 0, 'Long': 0,
                     'Year Constructed': first_row['Year Constructed'], 'Year Acquired': first_row['Year Acquired'],
                     'Expenses': -1}

        # Append to another DataFrame
        unit_df = unit_df.append(empty_row, ignore_index=True)
In [ ]:
# Expenses column will be used for various analysis, it doest have a specific task
unit_df.head()
Out[ ]:
Business Unit Division Region State Zip Code Lat Long Year Constructed Year Acquired Expenses
0 15.0 Northeast PA-DE Pennsylvania 19087 0 0 1960.0 2004.0 -1
1 18.0 Southeast Tampa Florida 33511 0 0 1986.0 1993.0 -1
2 20.0 Southeast Jacksonville Florida 32606 0 0 1986.0 1994.0 -1
3 43.0 Southeast Jacksonville Florida 32605 0 0 1974.0 1993.0 -1
4 51.0 Southeast Tampa Florida 34104 0 0 1992.0 1994.0 -1

We want to explore the effect of a place being a coastal city on various expenses. But we want to make sure if there is any units near coastal areas. For that we will plot them on a heatmap using folium library.

In [ ]:
# getting Latitude and Longitude of the Zipcodes
def get_lat_long(zip_code, country='us'):
    nomi = pgeocode.Nominatim(country)
    location = nomi.query_postal_code(zip_code)

    if location is not None and not location.empty:
        return location.latitude, location.longitude
    else:
        return None, None


# Example usage
latitude, longitude = get_lat_long("10001")  # Replace with any ZIP code
print("Latitude:", latitude, "Longitude:", longitude)
Latitude: 40.7484 Longitude: -73.9967
In [ ]:
# inserting lat and long of each zipcode into the the dataframe
# There are some missing values due to ZIPCode being 4 or less digits
for index, row in unit_df.iterrows():
    latitude, longitude = get_lat_long(row['Zip Code'])
    unit_df.at[index, 'Lat'] = latitude
    unit_df.at[index, 'Long'] = longitude
    print(row['Zip Code'],"     Latitude:", latitude, "Longitude:", longitude)
19087      Latitude: 40.0612 Longitude: -75.3999
33511      Latitude: 27.9056 Longitude: -82.2881
32606      Latitude: 29.6954 Longitude: -82.4023
32605      Latitude: 29.6785 Longitude: -82.3679
34104      Latitude: 26.1529 Longitude: -81.7417
33351      Latitude: 26.1793 Longitude: -80.2746
32309      Latitude: 30.5422 Longitude: -84.1413
32043      Latitude: 29.9983 Longitude: -81.7647
33180      Latitude: 25.9597 Longitude: -80.1403
32257      Latitude: 30.1927 Longitude: -81.605
33458      Latitude: 26.9339 Longitude: -80.1201
33618      Latitude: 28.0763 Longitude: -82.4852
33414      Latitude: 26.6587 Longitude: -80.2414
30319      Latitude: 33.8687 Longitude: -84.3351
32257      Latitude: 30.1927 Longitude: -81.605
27707      Latitude: 35.9631 Longitude: -78.9315
28226      Latitude: 35.0869 Longitude: -80.8167
45209      Latitude: 39.1516 Longitude: -84.4278
33015      Latitude: 25.9388 Longitude: -80.3165
33426      Latitude: 26.5175 Longitude: -80.0834
32073      Latitude: 30.1637 Longitude: -81.7291
30067      Latitude: 33.9282 Longitude: -84.4733
33511      Latitude: 27.9056 Longitude: -82.2881
32003      Latitude: 30.0933 Longitude: -81.719
19808      Latitude: 39.7359 Longitude: -75.6647
27713      Latitude: 35.9112 Longitude: -78.9178
60527      Latitude: 41.7447 Longitude: -87.9334
75034      Latitude: 33.1499 Longitude: -96.8241
43054      Latitude: 40.0847 Longitude: -82.7988
32258      Latitude: 30.1459 Longitude: -81.5739
92835      Latitude: 33.8994 Longitude: -117.9063
32080      Latitude: 29.7964 Longitude: -81.2649
94530      Latitude: 37.9156 Longitude: -122.2985
30318      Latitude: 33.7865 Longitude: -84.4454
32304      Latitude: 30.4478 Longitude: -84.3211
33186      Latitude: 25.6694 Longitude: -80.4085
80134      Latitude: 39.4895 Longitude: -104.8447
80921      Latitude: 39.0487 Longitude: -104.814
80920      Latitude: 38.9497 Longitude: -104.767
48430      Latitude: 42.7851 Longitude: -83.7294
37076      Latitude: 36.1848 Longitude: -86.6002
30338      Latitude: 33.9669 Longitude: -84.3249
30084      Latitude: 33.857 Longitude: -84.216
30345      Latitude: 33.8513 Longitude: -84.287
30342      Latitude: 33.8842 Longitude: -84.3761
30328      Latitude: 33.9335 Longitude: -84.3958
27608      Latitude: 35.8077 Longitude: -78.6463
30327      Latitude: 33.8627 Longitude: -84.42
30339      Latitude: 33.8713 Longitude: -84.4629
33609      Latitude: 27.9425 Longitude: -82.5057
30338      Latitude: 33.9669 Longitude: -84.3249
30329      Latitude: 33.8236 Longitude: -84.3214
30342      Latitude: 33.8842 Longitude: -84.3761
37027      Latitude: 36.0063 Longitude: -86.7909
37069      Latitude: 35.9796 Longitude: -86.9106
33928      Latitude: 26.4351 Longitude: -81.8102
60154      Latitude: 41.8524 Longitude: -87.8845
32092      Latitude: 29.9475 Longitude: -81.5264
45241      Latitude: 39.2638 Longitude: -84.4092
93446      Latitude: 35.6406 Longitude: -120.7003
32259      Latitude: 30.0815 Longitude: -81.5477
98203      Latitude: 47.9419 Longitude: -122.2218
80301      Latitude: 40.0497 Longitude: -105.2143
80301      Latitude: 40.0497 Longitude: -105.2143
91355      Latitude: 34.3985 Longitude: -118.5535
77479      Latitude: 29.5785 Longitude: -95.6066
90740      Latitude: 33.7602 Longitude: -118.0808
20171      Latitude: 38.9252 Longitude: -77.3928
27605      Latitude: 35.7908 Longitude: -78.653
20136      Latitude: 38.7343 Longitude: -77.5474
45255      Latitude: 39.0584 Longitude: -84.3396
43213      Latitude: 39.9727 Longitude: -82.8329
43082      Latitude: 40.1524 Longitude: -82.882
45069      Latitude: 39.3402 Longitude: -84.3998
27502      Latitude: 35.7225 Longitude: -78.8408
80003      Latitude: 39.8286 Longitude: -105.0655
75034      Latitude: 33.1499 Longitude: -96.8241
94608      Latitude: 37.8365 Longitude: -122.2804
92596      Latitude: 33.6243 Longitude: -117.0885
80920      Latitude: 38.9497 Longitude: -104.767
94521      Latitude: 37.9575 Longitude: -121.975
80111      Latitude: 39.6123 Longitude: -104.8799
77382      Latitude: 30.2106 Longitude: -95.5257
77381      Latitude: 30.1716 Longitude: -95.4985
77381      Latitude: 30.1716 Longitude: -95.4985
91361      Latitude: 34.1472 Longitude: -118.8383
75205      Latitude: 32.826 Longitude: -96.7843
75214      Latitude: 32.8248 Longitude: -96.7498
75230      Latitude: 32.8999 Longitude: -96.7897
78681      Latitude: 30.5083 Longitude: -97.6789
78759      Latitude: 30.4036 Longitude: -97.7526
80222      Latitude: 39.671 Longitude: -104.9279
80013      Latitude: 39.6575 Longitude: -104.7846
80120      Latitude: 39.5994 Longitude: -105.0044
94506      Latitude: 37.8321 Longitude: -121.9167
94598      Latitude: 37.9194 Longitude: -122.0259
95128      Latitude: 37.3163 Longitude: -121.9356
94061      Latitude: 37.4647 Longitude: -122.2304
94577      Latitude: 37.7205 Longitude: -122.1587
94087      Latitude: 37.3502 Longitude: -122.0349
94583      Latitude: 37.7562 Longitude: -121.9522
91364      Latitude: 34.1557 Longitude: -118.6
90254      Latitude: 33.8643 Longitude: -118.3955
91331      Latitude: 34.2556 Longitude: -118.4208
91360      Latitude: 34.2092 Longitude: -118.8739
92604      Latitude: 33.6899 Longitude: -117.7868
92835      Latitude: 33.8994 Longitude: -117.9063
92703      Latitude: 33.7489 Longitude: -117.9072
92648      Latitude: 33.6773 Longitude: -118.0051
92026      Latitude: 33.1605 Longitude: -117.0978
92108      Latitude: 32.7783 Longitude: -117.1335
92064      Latitude: 32.9756 Longitude: -117.0402
98188      Latitude: 47.4483 Longitude: -122.2731
97005      Latitude: 45.4963 Longitude: -122.8001
98075      Latitude: 47.5857 Longitude: -122.0345
98074      Latitude: 47.6254 Longitude: -122.0462
98074      Latitude: 47.6254 Longitude: -122.0462
97007      Latitude: 45.4505 Longitude: -122.8652
78751      Latitude: 30.3093 Longitude: -97.7242
91361      Latitude: 34.1472 Longitude: -118.8383
91361      Latitude: 34.1472 Longitude: -118.8383
76248      Latitude: 32.9276 Longitude: -97.2489
97140      Latitude: 45.3514 Longitude: -122.8567
95630      Latitude: 38.6709 Longitude: -121.1529
17033      Latitude: 40.2638 Longitude: -76.6545
32091      Latitude: 29.9583 Longitude: -82.1185
20112      Latitude: 38.6665 Longitude: -77.4248
21037      Latitude: 38.9149 Longitude: -76.5424
92127      Latitude: 33.0279 Longitude: -117.0856
46375      Latitude: 41.4922 Longitude: -87.4605
77354      Latitude: 30.2333 Longitude: -95.5502
80831      Latitude: 38.9541 Longitude: -104.5472
45227      Latitude: 39.1539 Longitude: -84.3854
32065      Latitude: 30.1382 Longitude: -81.7742
1906      Latitude: nan Longitude: nan
97124      Latitude: 45.5387 Longitude: -122.9636
2143      Latitude: nan Longitude: nan
32081      Latitude: 30.1204 Longitude: -81.4128
80634      Latitude: 40.4109 Longitude: -104.7541
80023      Latitude: 39.9619 Longitude: -105.0148
27615      Latitude: 35.8887 Longitude: -78.6393
97330      Latitude: 44.5904 Longitude: -123.2722
34604      Latitude: 28.4409 Longitude: -82.4612
18045      Latitude: 40.6957 Longitude: -75.2865
22079      Latitude: 38.6929 Longitude: -77.204
33073      Latitude: 26.2997 Longitude: -80.181
22030      Latitude: 38.8458 Longitude: -77.3242
32003      Latitude: 30.0933 Longitude: -81.719
33928      Latitude: 26.4351 Longitude: -81.8102
33558      Latitude: 28.1474 Longitude: -82.5152
34109      Latitude: 26.2534 Longitude: -81.7644
33618      Latitude: 28.0763 Longitude: -82.4852
27615      Latitude: 35.8887 Longitude: -78.6393
28262      Latitude: 35.3183 Longitude: -80.7476
34608      Latitude: 28.4797 Longitude: -82.5562
97501      Latitude: 42.2818 Longitude: -122.9054
60025      Latitude: 42.0758 Longitude: -87.8223
60062      Latitude: 42.1254 Longitude: -87.8465
1906      Latitude: nan Longitude: nan
63368      Latitude: 38.7513 Longitude: -90.7296
63044      Latitude: 38.7506 Longitude: -90.4161
63144      Latitude: 38.6182 Longitude: -90.3489
63122      Latitude: 38.5781 Longitude: -90.4256
33174      Latitude: 25.7628 Longitude: -80.3611
27705      Latitude: 36.0218 Longitude: -78.9478
98029      Latitude: 47.5585 Longitude: -122.0055
95618      Latitude: 38.5449 Longitude: -121.7405
55442      Latitude: 45.0467 Longitude: -93.431
78753      Latitude: 30.3649 Longitude: -97.6827
11755      Latitude: 40.8567 Longitude: -73.1168
92823      Latitude: 33.923 Longitude: -117.798
77494      Latitude: 29.7404 Longitude: -95.8304
94568      Latitude: 37.7166 Longitude: -121.9226
28027      Latitude: 35.4141 Longitude: -80.6162
92782      Latitude: 33.7346 Longitude: -117.7869
20147      Latitude: 39.0373 Longitude: -77.4805
92111      Latitude: 32.7972 Longitude: -117.1708
27514      Latitude: 35.9203 Longitude: -79.0372
92103      Latitude: 32.7466 Longitude: -117.1636
30328      Latitude: 33.9335 Longitude: -84.3958
80033      Latitude: 39.774 Longitude: -105.0962
20910      Latitude: 38.9982 Longitude: -77.0338
75082      Latitude: 32.9865 Longitude: -96.686
80602      Latitude: 39.9636 Longitude: -104.9072
32202      Latitude: 30.3299 Longitude: -81.6517
2155      Latitude: nan Longitude: nan
20151      Latitude: 38.8867 Longitude: -77.4457
27609      Latitude: 35.848 Longitude: -78.6317
78746      Latitude: 30.2971 Longitude: -97.8181
6825      Latitude: nan Longitude: nan
6824      Latitude: nan Longitude: nan
6824      Latitude: nan Longitude: nan
6824      Latitude: nan Longitude: nan
6824      Latitude: nan Longitude: nan
6825      Latitude: nan Longitude: nan
6825      Latitude: nan Longitude: nan
6824      Latitude: nan Longitude: nan
60614      Latitude: 41.9229 Longitude: -87.6483
33612      Latitude: 28.0502 Longitude: -82.45
77382      Latitude: 30.2106 Longitude: -95.5257
98102      Latitude: 47.6302 Longitude: -122.321
33431      Latitude: 26.3799 Longitude: -80.0975
27609      Latitude: 35.848 Longitude: -78.6317
60061      Latitude: 42.2288 Longitude: -87.9719
11040      Latitude: 40.7294 Longitude: -73.6828
77479      Latitude: 29.5785 Longitude: -95.6066
33156      Latitude: 25.6682 Longitude: -80.2973
75082      Latitude: 32.9865 Longitude: -96.686
97501      Latitude: 42.2818 Longitude: -122.9054
29464      Latitude: 32.8473 Longitude: -79.8206
95468      Latitude: 38.9152 Longitude: -123.6
77389      Latitude: 30.1044 Longitude: -95.5066
98029      Latitude: 47.5585 Longitude: -122.0055
32266      Latitude: 30.3155 Longitude: -81.4051
10522      Latitude: 41.0118 Longitude: -73.8665
90036      Latitude: 34.0699 Longitude: -118.3492
11557      Latitude: 40.6404 Longitude: -73.6957
92131      Latitude: 32.9123 Longitude: -117.0898
91355      Latitude: 34.3985 Longitude: -118.5535
93446      Latitude: 35.6406 Longitude: -120.7003
8840      Latitude: nan Longitude: nan
11557      Latitude: 40.6404 Longitude: -73.6957
98115      Latitude: 47.6849 Longitude: -122.2968
98107      Latitude: 47.6701 Longitude: -122.3763
98107      Latitude: 47.6701 Longitude: -122.3763
27607      Latitude: 35.8014 Longitude: -78.6877
90804      Latitude: 33.7857 Longitude: -118.1357
94598      Latitude: 37.9194 Longitude: -122.0259
11530      Latitude: 40.7245 Longitude: -73.6487
98122      Latitude: 47.6116 Longitude: -122.3056
11590      Latitude: 40.7557 Longitude: -73.5723
8807      Latitude: nan Longitude: nan
8807      Latitude: nan Longitude: nan
98115      Latitude: 47.6849 Longitude: -122.2968
95008      Latitude: 37.2803 Longitude: -121.9539
33177      Latitude: 25.5968 Longitude: -80.4046
28277      Latitude: 35.0552 Longitude: -80.8195
28277      Latitude: 35.0552 Longitude: -80.8195
11580      Latitude: 40.6742 Longitude: -73.7057
11554      Latitude: 40.7149 Longitude: -73.5561
11792      Latitude: 40.952 Longitude: -72.8348
11941      Latitude: 40.8297 Longitude: -72.7283
77382      Latitude: 30.2106 Longitude: -95.5257
75002      Latitude: 33.0934 Longitude: -96.6454
94040      Latitude: 37.3855 Longitude: -122.088
30338      Latitude: 33.9669 Longitude: -84.3249
91384      Latitude: 34.4827 Longitude: -118.6254
75044      Latitude: 32.9522 Longitude: -96.6654
32708      Latitude: 28.6831 Longitude: -81.2814
95376      Latitude: 37.7383 Longitude: -121.4345
55124      Latitude: 44.7497 Longitude: -93.2029
20037      Latitude: 38.9014 Longitude: -77.0504
55416      Latitude: 44.9497 Longitude: -93.3373
10065      Latitude: 40.7651 Longitude: -73.9638
10065      Latitude: 40.7651 Longitude: -73.9638
10011      Latitude: 40.7402 Longitude: -73.9996
6880      Latitude: nan Longitude: nan
11374      Latitude: 40.7278 Longitude: -73.8602
6877      Latitude: nan Longitude: nan
32817      Latitude: 28.5891 Longitude: -81.2277
32233      Latitude: 30.3483 Longitude: -81.4159
33180      Latitude: 25.9597 Longitude: -80.1403
33176      Latitude: 25.6574 Longitude: -80.3627
33155      Latitude: 25.7392 Longitude: -80.3103
33433      Latitude: 26.3464 Longitude: -80.1564
33436      Latitude: 26.5354 Longitude: -80.1124
30114      Latitude: 34.2505 Longitude: -84.4909
10463      Latitude: 40.8798 Longitude: -73.9067
10463      Latitude: 40.8798 Longitude: -73.9067
6082      Latitude: nan Longitude: nan
30326      Latitude: 33.8482 Longitude: -84.3582
34986      Latitude: 27.3215 Longitude: -80.403
33948      Latitude: 26.9827 Longitude: -82.1412
30328      Latitude: 33.9335 Longitude: -84.3958
90815      Latitude: 33.7939 Longitude: -118.1192
11416      Latitude: 40.6838 Longitude: -73.8514
6880      Latitude: nan Longitude: nan
33165      Latitude: 25.7343 Longitude: -80.3588
6877      Latitude: nan Longitude: nan
33176      Latitude: 25.6574 Longitude: -80.3627
33330      Latitude: 26.0663 Longitude: -80.3339
90232      Latitude: 34.0168 Longitude: -118.3973
6811      Latitude: nan Longitude: nan
6854      Latitude: nan Longitude: nan
11530      Latitude: 40.7245 Longitude: -73.6487
33180      Latitude: 25.9597 Longitude: -80.1403
34231      Latitude: 27.2666 Longitude: -82.5163
33461      Latitude: 26.6232 Longitude: -80.0946
33196      Latitude: 25.6615 Longitude: -80.441
32811      Latitude: 28.5163 Longitude: -81.4516
32746      Latitude: 28.7577 Longitude: -81.3508
32257      Latitude: 30.1927 Longitude: -81.605
32250      Latitude: 30.2801 Longitude: -81.4165
34108      Latitude: 26.2416 Longitude: -81.8071
30305      Latitude: 33.832 Longitude: -84.3851
33324      Latitude: 26.1255 Longitude: -80.2644
33067      Latitude: 26.3033 Longitude: -80.2415
94596      Latitude: 37.9053 Longitude: -122.0549
33157      Latitude: 25.6062 Longitude: -80.3426
6820      Latitude: nan Longitude: nan
94103      Latitude: 37.7725 Longitude: -122.4147
33410      Latitude: 26.8234 Longitude: -80.1387
90815      Latitude: 33.7939 Longitude: -118.1192
32966      Latitude: 27.6372 Longitude: -80.4794
94070      Latitude: 37.4969 Longitude: -122.2674
33442      Latitude: 26.3124 Longitude: -80.1412
2360      Latitude: nan Longitude: nan
33021      Latitude: 26.0218 Longitude: -80.1891
33193      Latitude: 25.6964 Longitude: -80.4401
33477      Latitude: 26.9217 Longitude: -80.077
33408      Latitude: 26.8289 Longitude: -80.0603
33029      Latitude: 25.9924 Longitude: -80.4089
33173      Latitude: 25.6992 Longitude: -80.3618
33173      Latitude: 25.6992 Longitude: -80.3618
33179      Latitude: 25.9571 Longitude: -80.1814
32250      Latitude: 30.2801 Longitude: -81.4165
32968      Latitude: 27.5999 Longitude: -80.4382
6488      Latitude: nan Longitude: nan
2138      Latitude: nan Longitude: nan
2169      Latitude: nan Longitude: nan
2132      Latitude: nan Longitude: nan
33558      Latitude: 28.1474 Longitude: -82.5152
92673      Latitude: 33.4615 Longitude: -117.6375
33321      Latitude: 26.212 Longitude: -80.2696
33179      Latitude: 25.9571 Longitude: -80.1814
2138      Latitude: nan Longitude: nan
95616      Latitude: 38.5538 Longitude: -121.7418
34986      Latitude: 27.3215 Longitude: -80.403
30213      Latitude: 33.5648 Longitude: -84.5809
34743      Latitude: 28.3306 Longitude: -81.3544
32960      Latitude: 27.633 Longitude: -80.4031
32792      Latitude: 28.5974 Longitude: -81.3036
90815      Latitude: 33.7939 Longitude: -118.1192
6854      Latitude: nan Longitude: nan
33033      Latitude: 25.4906 Longitude: -80.438
33165      Latitude: 25.7343 Longitude: -80.3588
33193      Latitude: 25.6964 Longitude: -80.4401
11590      Latitude: 40.7557 Longitude: -73.5723
33317      Latitude: 26.1122 Longitude: -80.2264
33317      Latitude: 26.1122 Longitude: -80.2264
20816      Latitude: 38.9585 Longitude: -77.1153
20816      Latitude: 38.9585 Longitude: -77.1153
20816      Latitude: 38.9585 Longitude: -77.1153
30338      Latitude: 33.9669 Longitude: -84.3249
94520      Latitude: 37.9823 Longitude: -122.0362
94015      Latitude: 37.6787 Longitude: -122.478
94103      Latitude: 37.7725 Longitude: -122.4147
91344      Latitude: 34.2771 Longitude: -118.4992
91301      Latitude: 34.1227 Longitude: -118.7573
92821      Latitude: 33.9291 Longitude: -117.8845
92677      Latitude: 33.5145 Longitude: -117.7084
92119      Latitude: 32.8036 Longitude: -117.0261
92110      Latitude: 32.7635 Longitude: -117.2028
91941      Latitude: 32.7604 Longitude: -117.0115
94066      Latitude: 37.6247 Longitude: -122.429
95051      Latitude: 37.3483 Longitude: -121.9844
94523      Latitude: 37.954 Longitude: -122.0737
94558      Latitude: 38.4549 Longitude: -122.2564
95136      Latitude: 37.2685 Longitude: -121.849
94598      Latitude: 37.9194 Longitude: -122.0259
93105      Latitude: 34.4369 Longitude: -119.7285
80302      Latitude: 40.0172 Longitude: -105.2851
80033      Latitude: 39.774 Longitude: -105.0962
80122      Latitude: 39.5814 Longitude: -104.9557
80004      Latitude: 39.8141 Longitude: -105.1177
6110      Latitude: nan Longitude: nan
20016      Latitude: 38.9381 Longitude: -77.086
19810      Latitude: 39.8188 Longitude: -75.5064
60714      Latitude: 42.0312 Longitude: -87.8112
60608      Latitude: 41.8515 Longitude: -87.6694
60618      Latitude: 41.9464 Longitude: -87.7042
46268      Latitude: 39.8682 Longitude: -86.2123
46268      Latitude: 39.8682 Longitude: -86.2123
21208      Latitude: 39.3764 Longitude: -76.729
21234      Latitude: 39.3876 Longitude: -76.5418
21230      Latitude: 39.2645 Longitude: -76.6224
20874      Latitude: 39.1355 Longitude: -77.2822
20878      Latitude: 39.1125 Longitude: -77.2515
20912      Latitude: 38.9832 Longitude: -77.0007
20774      Latitude: 38.8682 Longitude: -76.8156
20901      Latitude: 39.0191 Longitude: -77.0076
55391      Latitude: 44.9847 Longitude: -93.5422
55442      Latitude: 45.0467 Longitude: -93.431
27511      Latitude: 35.7641 Longitude: -78.7786
7470      Latitude: nan Longitude: nan
8108      Latitude: nan Longitude: nan
97223      Latitude: 45.4403 Longitude: -122.7793
18102      Latitude: 40.6068 Longitude: -75.4781
18017      Latitude: 40.6622 Longitude: -75.3903
19151      Latitude: 39.9772 Longitude: -75.2545
18901      Latitude: 40.3054 Longitude: -75.1489
19073      Latitude: 39.9863 Longitude: -75.407
18929      Latitude: 40.2566 Longitude: -75.0961
77005      Latitude: 29.7179 Longitude: -95.4263
77057      Latitude: 29.7422 Longitude: -95.4903
23111      Latitude: 37.6281 Longitude: -77.3395
23226      Latitude: 37.5825 Longitude: -77.5197
20147      Latitude: 39.0373 Longitude: -77.4805
20121      Latitude: 38.8195 Longitude: -77.4558
22310      Latitude: 38.7794 Longitude: -77.1194
20171      Latitude: 38.9252 Longitude: -77.3928
22033      Latitude: 38.8776 Longitude: -77.3885
22030      Latitude: 38.8458 Longitude: -77.3242
22151      Latitude: 38.8029 Longitude: -77.2116
22153      Latitude: 38.7449 Longitude: -77.237
22044      Latitude: 38.8589 Longitude: -77.1548
22044      Latitude: 38.8589 Longitude: -77.1548
98026      Latitude: 47.8353 Longitude: -122.327
98006      Latitude: 47.5614 Longitude: -122.1552
98052      Latitude: 47.6718 Longitude: -122.1232
77005      Latitude: 29.7179 Longitude: -95.4263
29412      Latitude: 32.718 Longitude: -79.9537
95667      Latitude: 38.7195 Longitude: -120.8046
32092      Latitude: 29.9475 Longitude: -81.5264
34119      Latitude: 26.2665 Longitude: -81.7146
2155      Latitude: nan Longitude: nan
98102      Latitude: 47.6302 Longitude: -122.321
8807      Latitude: nan Longitude: nan
8807      Latitude: nan Longitude: nan
22201      Latitude: 38.8871 Longitude: -77.0932
22201      Latitude: 38.8871 Longitude: -77.0932
22201      Latitude: 38.8871 Longitude: -77.0932
22201      Latitude: 38.8871 Longitude: -77.0932
22201      Latitude: 38.8871 Longitude: -77.0932
32819      Latitude: 28.4522 Longitude: -81.4678
32819      Latitude: 28.4522 Longitude: -81.4678
30114      Latitude: 34.2505 Longitude: -84.4909
6880      Latitude: nan Longitude: nan
6880      Latitude: nan Longitude: nan
6880      Latitude: nan Longitude: nan
6880      Latitude: nan Longitude: nan
In [ ]:
unit_df.head()
Out[ ]:
Business Unit Division Region State Zip Code Lat Long Year Constructed Year Acquired Expenses
0 15.0 Northeast PA-DE Pennsylvania 19087 40.0612 -75.3999 1960.0 2004.0 -1
1 18.0 Southeast Tampa Florida 33511 27.9056 -82.2881 1986.0 1993.0 -1
2 20.0 Southeast Jacksonville Florida 32606 29.6954 -82.4023 1986.0 1994.0 -1
3 43.0 Southeast Jacksonville Florida 32605 29.6785 -82.3679 1974.0 1993.0 -1
4 51.0 Southeast Tampa Florida 34104 26.1529 -81.7417 1992.0 1994.0 -1
In [ ]:
unit_df_no_null = unit_df.dropna(subset=['Lat','Long'])
In [ ]:
# Create a map instance
m = folium.Map(location=[unit_df_no_null['Lat'].mean(), unit_df_no_null['Long'].mean()], zoom_start=6)

# Create a list of lat and lon pairs
heat_data = [[row['Lat'], row['Long']] for index, row in unit_df_no_null.iterrows()]

# Create a HeatMap layer and add it to the map
HeatMap(heat_data).add_to(m)

# Save it to an HTML file
m.save('heatmap.html')

The map shows that the main focus of regency areas is near water areas. However, there is enough sample of regency malls far from beaches and water surfaces. Looking at the map there is no need to write a code to determine if the location is near a water surface or not, the malls are either all away from water surfaces or near them by state. The only exception to this rule might be Houston in Texas where it is near a water surface.

In [ ]:
m
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

To determine if a mall is near water or not, we can generalize the analysis by considering the state in which the mall is located. This approach can be helpful as taxes are also based on state tax rates. Going to manually assign near water value for each state

In [ ]:
unit_df['State'].unique()
Out[ ]:
array(['Pennsylvania', 'Florida', 'Georgia', 'North Carolina', 'Ohio',
       'Delaware', 'Illinois', 'Texas', 'California', 'Colorado',
       'Michigan', 'Tennessee', 'Washington', 'Virginia', 'Oregon',
       'Maryland', 'Indiana', 'Massachusetts', 'Missouri', 'Minnesota',
       'New York', 'Connecticut', 'South Carolina', 'New Jersey',
       'District of Columbia'], dtype=object)
In [ ]:
unit_df['Near Water'] = np.zeros
In [ ]:
# Decision for each state is based on heatmap
# 'Pennsylvania' : 0
# 'Florida' : 1
# 'Georgia' : 0
# 'North Carolina' : 0
# 'Ohio' : 0
# 'Delaware' : 1
# 'Illinois' : 1
# 'Texas' : Not sure but will do 0 for now
# 'California' : 1
# 'Colorado' : 0
# 'Michigan' : 0
# 'Virginia' : 0
# 'Oregon' : 1 I think maybe it's near water
# 'Maryland' : 1
# 'Indiana' : 0
# 'Massachusetts' : -1 maybe one of the Nan values
# 'Missouri' : 0
# 'Minnesota' : 0
# 'New York' : 1
# 'Connecticut' : 1 it is not showing but i believe its on beach following NYC shops trend
# 'South Carolina' : 1
# 'New Jersey': 0
# 'District of Columbia': 0
In [ ]:
statesWithWater = ['Florida','Delaware', 'Illinois', 'California','Oregon', 'Maryland', 'New York', 'Connecticut', 'South Carolina']
# Changing values of rows to states where we determine if stores are near beach or not
unit_df['Near Water'] = unit_df['State'].isin(statesWithWater).astype(int)
In [ ]:
unit_df.head()
Out[ ]:
Business Unit Division Region State Zip Code Lat Long Year Constructed Year Acquired Expenses Near Water
0 15.0 Northeast PA-DE Pennsylvania 19087 40.0612 -75.3999 1960.0 2004.0 -1 0
1 18.0 Southeast Tampa Florida 33511 27.9056 -82.2881 1986.0 1993.0 -1 1
2 20.0 Southeast Jacksonville Florida 32606 29.6954 -82.4023 1986.0 1994.0 -1 1
3 43.0 Southeast Jacksonville Florida 32605 29.6785 -82.3679 1974.0 1993.0 -1 1
4 51.0 Southeast Tampa Florida 34104 26.1529 -81.7417 1992.0 1994.0 -1 1

Doing analysis on Category expenses instead of Expense. Expense column is very detailed and varies greatly between every month, it might have high accuracy over yearly predictions but too valotile for monthly predictions.

In [ ]:
# Getting column names with dates
dates = []
for index, date in enumerate(regencyData.columns):
    if (index >= 14):
        dates.append(date)
In [ ]:
# Combining expenses per month using Control column
unitExpensesPerMonth = regencyData.groupby(['Business Unit', 'Category'])[dates].sum().reset_index()
unitExpensesPerMonth.head().T
Out[ ]:
0 1 2 3 4
Business Unit 15.0 15.0 15.0 15.0 18.0
Category Insurance Expense Non-CAM Expenses Operating Expenses Real Estate Tax Expense Insurance Expense
2000-01 0.0 0.0 0.0 0.0 -4175.0
2000-02 0.0 0.0 0.0 0.0 -4175.0
2000-03 0.0 0.0 0.0 0.0 -4175.0
... ... ... ... ... ...
2023-04 -12948.47 -720.05 -73458.0 -38166.58 -22037.25
2023-05 -12948.47 -482.55 -25715.77 -38166.58 -22037.25
2023-06 -18159.12 -1933.35 -77736.53 -38166.59 -31604.16
2023-07 -14685.33 -479.55 -96710.95 -45473.42 -25226.22
2023-08 -14685.33 -6986.82 -72839.18 -39210.42 -25226.22

286 rows × 5 columns

In [ ]:
unique_business_units = unitExpensesPerMonth['Business Unit'].unique().tolist()

Plotting each of the expenses for all the units over the years to explore if there is any trends

In [ ]:
def get_list_category(unitId, category):
    expensesList = unitExpensesPerMonth.loc[(unitExpensesPerMonth['Business Unit'] == unitId) &
                                   (unitExpensesPerMonth['Category'] == category)].values.tolist()
    expensesList = expensesList[0]
    expensesList = expensesList[2:]
    return expensesList

This is a plot to see trends in the units. What is surprising is that there is positive expenses which most likely is an error

In [ ]:
# Create a figure
plt.figure(figsize=(10, 6))

# Labels for each line
#labels = ['Line 1', 'Line 2', 'Line 3']

# Loop over DataFrames and add each as a line chart
for unitId in unitExpensesPerMonth['Business Unit'].unique().tolist():
    try:
        y = get_list_category(unitId, 'Insurance Expense')
        plt.plot(dates, y, label=unitId)
    except:
        print(unitId)

# Add labels and title
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Multiple Line Charts')
plt.xticks(rotation=90)
# Rotate labels for better readability
#plt.gcf().autofmt_xdate()
# Set x-axis to show every third tick
plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))

# Add legend
plt.legend()

# Show the plot
plt.show()
60538.0
60780.0
60837.0
60838.0
60860.0
60869.0
60876.0
60877.0
60880.0
80097.0
92053.0
No description has been provided for this image

First plot is too clustered. So each unit will be plotted based on states. Wich is generally better since tax laws differ from state to state.

In [ ]:
def plot_by_month(df, state, category, water):


    # Create a figure
    plt.figure(figsize=(10, 6))

    # Labels for each line
    #labels = ['Line 1', 'Line 2', 'Line 3']

    # Loop over DataFrames and add each as a line chart
    for unitId in df['Business Unit'].unique().tolist():
        try:
            if unitId not in [170.0]:
                y = get_list_category(unitId, category)
                plotFlag = True

                for value in y:
                    if value < -50000:
                        plotFlag = False
                if plotFlag:
                    # plt.plot(dates, y, label=unitId)
                    plt.plot(dates, y)
        except:
            print(unitId)

    # Add labels and title
    plt.xlabel('X-axis label')
    plt.ylabel('Y-axis label')
    plt.title(f'{state}     {category}')
    plt.xticks(rotation=90)
    # Rotate labels for better readability
    #plt.gcf().autofmt_xdate()
    # Set x-axis to show every third tick
    plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))

    if(water):
        plt.gca().set_facecolor('lightblue')

    # Add legend
    plt.legend()

    # Show the plot
    plt.show()

Based on the followingplots what was noticed is the following:

  • All the states have similar trends in general
  • States that are marked near water are most likely to have more than 10,000 insurance per month
  • The insurance was stable till 2020 and insurance prices started dramatically increasing, most likely due to covid.
  • State is very good indicator for expenses as in most graphs the spending trends and costs are correlated and clustered together
In [ ]:
# plots with light blue background are states where shops are near water
# Plotting insurance expenses
for state in unit_df['State'].unique().tolist():
    df = unit_df[unit_df['State'] == state]
    nearWater = df['Near Water'].iloc[0]
    plot_by_month(df, state, 'Insurance Expense',nearWater)
No description has been provided for this image
80097.0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60780.0
60837.0
60838.0
60860.0
92053.0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60538.0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60869.0
60876.0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60877.0
60880.0
No description has been provided for this image
No description has been provided for this image

There is no apparent trend in Non-Cam Expenses. In general is seems to be very volatile and changing from month to month

In [ ]:
# plots with light blue background are states where shops are near water
# Plotting Non-CAM Expenses
for state in unit_df['State'].unique().tolist():
    df = unit_df[unit_df['State'] == state]
    nearWater = df['Near Water'].iloc[0]
    plot_by_month(df, state, 'Non-CAM Expenses',nearWater)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The operating expenses vary from month to month. It seems to be slowly increasing in a steady rate. Using a linear or logistic regresstion function to predict it seems feasable.

In [ ]:
# plots with light blue background are states where shops are near water
# Plotting insurance expenses
for state in unit_df['State'].unique().tolist():
    df = unit_df[unit_df['State'] == state]
    nearWater = df['Near Water'].iloc[0]
    plot_by_month(df, state, 'Operating Expenses',nearWater)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Tax expense is slowly increasing over the year. Linear equation would be sufficient for prediction

In [ ]:
# plots with light blue background are states where shops are near water
# Plotting Real Estate Tax Expense
for state in unit_df['State'].unique().tolist():
    df = unit_df[unit_df['State'] == state]
    nearWater = df['Near Water'].iloc[0]
    plot_by_month(df, state, 'Real Estate Tax Expense',nearWater)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60780.0
60837.0
60838.0
60860.0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60538.0
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
60869.0
60876.0
No description has been provided for this image
60709.0
No description has been provided for this image
No description has been provided for this image
60877.0
60880.0
No description has been provided for this image
No description has been provided for this image

Create correlation heatmap to determine the effect of construction year on expenses

In [ ]:
averageExpenses = unitExpensesPerMonth.copy()
In [ ]:
averageExpenses.head().T
Out[ ]:
0 1 2 3 4
Business Unit 15.0 15.0 15.0 15.0 18.0
Category Insurance Expense Non-CAM Expenses Operating Expenses Real Estate Tax Expense Insurance Expense
2000-01 0.0 0.0 0.0 0.0 -4175.0
2000-02 0.0 0.0 0.0 0.0 -4175.0
2000-03 0.0 0.0 0.0 0.0 -4175.0
... ... ... ... ... ...
2023-04 -12948.47 -720.05 -73458.0 -38166.58 -22037.25
2023-05 -12948.47 -482.55 -25715.77 -38166.58 -22037.25
2023-06 -18159.12 -1933.35 -77736.53 -38166.59 -31604.16
2023-07 -14685.33 -479.55 -96710.95 -45473.42 -25226.22
2023-08 -14685.33 -6986.82 -72839.18 -39210.42 -25226.22

286 rows × 5 columns

getting average expense for each unit and Insurance Expense category over all the years.

In [ ]:
averageExpenses['Average Expenses'] = np.zeros(len(averageExpenses))
length = len(dates)
for index, row in averageExpenses.iterrows():
    total = 0
    for date in dates:
        total += row[date]  # Assuming each 'date' column contains numerical values
    averageExpenses.at[index, 'Average Expenses'] = total / length
In [ ]:
# Initialize 'Year Constructed' column with zeros (or another default value)
averageExpenses['Year Constructed'] = np.zeros(len(averageExpenses))

for index, row in averageExpenses.iterrows():
    # Get 'Year Constructed' from unit_df for the matching 'Business Unit'
    matching_rows = unit_df[unit_df['Business Unit'] == row['Business Unit']]['Year Constructed']

    # Check if there are matching rows and assign the value
    if not matching_rows.empty:
        # Assuming you want the first match if there are multiple
        averageExpenses.at[index, 'Year Constructed'] = matching_rows.iloc[0]
    else:
        # Optional: set to NaN or keep as zero if no match is found
        averageExpenses.at[index, 'Year Constructed'] = np.nan  # or some default value
In [ ]:
averageExpenses1 = averageExpenses[['Year Constructed', 'Category','Average Expenses']]

Surprisingly, year of construction has low correlation with the average values of each of the expenses.

In [ ]:
df = averageExpenses1[averageExpenses1['Category'] == 'Insurance Expense']
corr = df.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
No description has been provided for this image
In [ ]:
df = averageExpenses1[averageExpenses1['Category'] == 'Non-CAM Expenses']
corr = df.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
No description has been provided for this image
In [ ]:
df = averageExpenses1[averageExpenses1['Category'] == 'Operating Expenses']
corr = df.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
No description has been provided for this image
In [ ]:
df = averageExpenses1[averageExpenses1['Category'] == 'Real Estate Tax Expense']
corr = df.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for Ordinal Values')
plt.show()
No description has been provided for this image
In [ ]:
 

Found gas price dataset on U.S Energy information administration. It contains average petrol price ascross the state. Currently it contains only 9 states, but they cover all axis of US:

  • California
  • Colorado
  • Florida
  • Massachusetts
  • Minnesota
  • New York
  • Ohio
  • Texas
  • Washington

Reading Gas prices so they can be used to see how they might impact the expenses, if the state we want to predict not in this list we will make predictions based on the nearest state to it that exists in this list

Link: https://www.eia.gov/dnav/pet/pet_pri_gnd_a_epm0_pte_dpgal_w.htm

WIP 2

After searching online for retail prices of gasoline per state. A better dataset was found and I was downloaded from the following link: https://www.eia.gov/dnav/pet/PET_PRI_REFMG_A_EPM0_PWG_DPGAL_M.htm

In [ ]:
gasData = pd.read_excel('States_gas.xls', sheet_name='Data 1', header = 1)
In [ ]:
gasData.head()
Out[ ]:
Sourcekey EMA_EPM0_PWG_NUS_DPG EMA_EPM0_PWG_R10_DPG EMA_EPM0_PWG_R1X_DPG EMA_EPM0_PWG_SCT_DPG EMA_EPM0_PWG_SME_DPG EMA_EPM0_PWG_SMA_DPG EMA_EPM0_PWG_SNH_DPG EMA_EPM0_PWG_SRI_DPG EMA_EPM0_PWG_SVT_DPG EMA_EPM0_PWG_R1Y_DPG EMA_EPM0_PWG_SDE_DPG EMA_EPM0_PWG_SDC_DPG EMA_EPM0_PWG_SMD_DPG EMA_EPM0_PWG_SNJ_DPG EMA_EPM0_PWG_SNY_DPG EMA_EPM0_PWG_SPA_DPG EMA_EPM0_PWG_R1Z_DPG EMA_EPM0_PWG_SFL_DPG EMA_EPM0_PWG_SGA_DPG EMA_EPM0_PWG_SNC_DPG EMA_EPM0_PWG_SSC_DPG EMA_EPM0_PWG_SVA_DPG EMA_EPM0_PWG_SWV_DPG EMA_EPM0_PWG_R20_DPG EMA_EPM0_PWG_SIL_DPG EMA_EPM0_PWG_SIN_DPG EMA_EPM0_PWG_SIA_DPG EMA_EPM0_PWG_SKS_DPG EMA_EPM0_PWG_SKY_DPG EMA_EPM0_PWG_SMI_DPG EMA_EPM0_PWG_SMN_DPG EMA_EPM0_PWG_SMO_DPG EMA_EPM0_PWG_SNE_DPG EMA_EPM0_PWG_SND_DPG EMA_EPM0_PWG_SOH_DPG EMA_EPM0_PWG_SOK_DPG EMA_EPM0_PWG_SSD_DPG EMA_EPM0_PWG_STN_DPG EMA_EPM0_PWG_SWI_DPG EMA_EPM0_PWG_R30_DPG EMA_EPM0_PWG_SAL_DPG EMA_EPM0_PWG_SAR_DPG EMA_EPM0_PWG_SLA_DPG EMA_EPM0_PWG_SMS_DPG EMA_EPM0_PWG_SNM_DPG EMA_EPM0_PWG_STX_DPG EMA_EPM0_PWG_R40_DPG EMA_EPM0_PWG_SCO_DPG EMA_EPM0_PWG_SID_DPG EMA_EPM0_PWG_SMT_DPG EMA_EPM0_PWG_SUT_DPG EMA_EPM0_PWG_SWY_DPG EMA_EPM0_PWG_R50_DPG EMA_EPM0_PWG_SAK_DPG EMA_EPM0_PWG_SAZ_DPG EMA_EPM0_PWG_SCA_DPG EMA_EPM0_PWG_SHI_DPG EMA_EPM0_PWG_SNV_DPG EMA_EPM0_PWG_SOR_DPG EMA_EPM0_PWG_SWA_DPG
0 Date U.S. Total Gasoline Wholesale/Resale Price by ... East Coast (PADD 1) Total Gasoline Wholesale/R... New England (PADD 1A) Total Gasoline Wholesale... Connecticut Total Gasoline Wholesale/Resale Pr... Maine Total Gasoline Wholesale/Resale Price by... Massachusetts Total Gasoline Wholesale/Resale ... New Hampshire Total Gasoline Wholesale/Resale ... Rhode Island Total Gasoline Wholesale/Resale P... Vermont Total Gasoline Wholesale/Resale Price ... Central Atlantic (PADD 1B) Total Gasoline Whol... Delaware Total Gasoline Wholesale/Resale Price... District of Columbia Total Gasoline Wholesale/... Maryland Total Gasoline Wholesale/Resale Price... New Jersey Total Gasoline Wholesale/Resale Pri... New York Total Gasoline Wholesale/Resale Price... Pennsylvania Total Gasoline Wholesale/Resale P... Lower Atlantic (PADD 1C) Total Gasoline Wholes... Florida Total Gasoline Wholesale/Resale Price ... Georgia Total Gasoline Wholesale/Resale Price ... North Carolina Total Gasoline Wholesale/Resale... South Carolina Total Gasoline Wholesale/Resale... Virginia Total Gasoline Wholesale/Resale Price... West Virginia Total Gasoline Wholesale/Resale ... Midwest (PADD 2) Total Gasoline Wholesale/Resa... Illinois Total Gasoline Wholesale/Resale Price... Indiana Total Gasoline Wholesale/Resale Price ... Iowa Total Gasoline Wholesale/Resale Price by ... Kansas Total Gasoline Wholesale/Resale Price b... Kentucky Total Gasoline Wholesale/Resale Price... Michigan Total Gasoline Wholesale/Resale Price... Minnesota Total Gasoline Wholesale/Resale Pric... Missouri Total Gasoline Wholesale/Resale Price... Nebraska Total Gasoline Wholesale/Resale Price... North Dakota Total Gasoline Wholesale/Resale P... Ohio Total Gasoline Wholesale/Resale Price by ... Oklahoma Total Gasoline Wholesale/Resale Price... South Dakota Total Gasoline Wholesale/Resale P... Tennessee Total Gasoline Wholesale/Resale Pric... Wisconsin Total Gasoline Wholesale/Resale Pric... Gulf Coast (PADD 3) Total Gasoline Wholesale/R... Alabama Total Gasoline Wholesale/Resale Price ... Arkansas Total Gasoline Wholesale/Resale Price... Louisiana Total Gasoline Wholesale/Resale Pric... Mississippi Total Gasoline Wholesale/Resale Pr... New Mexico Total Gasoline Wholesale/Resale Pri... Texas Total Gasoline Wholesale/Resale Price by... Rocky Mountain (PADD 4) Total Gasoline Wholesa... Colorado Total Gasoline Wholesale/Resale Price... Idaho Total Gasoline Wholesale/Resale Price by... Montana Total Gasoline Wholesale/Resale Price ... Utah Total Gasoline Wholesale/Resale Price by ... Wyoming Total Gasoline Wholesale/Resale Price ... West Coast (PADD 5) Total Gasoline Wholesale/R... Alaska Total Gasoline Wholesale/Resale Price b... Arizona Total Gasoline Wholesale/Resale Price ... California Total Gasoline Wholesale/Resale Pri... Hawaii Total Gasoline Wholesale/Resale Price b... Nevada Total Gasoline Wholesale/Resale Price b... Oregon Total Gasoline Wholesale/Resale Price b... Washington Total Gasoline Wholesale/Resale Pri...
1 1983-01-15 00:00:00 0.885 0.91 0.944 0.957 0.918 0.939 0.959 0.942 0.994 0.92 0.859 0.932 0.907 0.911 0.964 0.901 0.889 0.907 0.877 0.866 0.873 0.896 0.944 0.888 0.896 0.896 0.892 0.847 0.895 0.906 0.894 0.869 0.881 0.915 0.914 0.855 0.896 0.877 0.895 0.861 0.871 0.869 0.876 0.852 0.916 0.854 0.891 0.865 0.937 0.887 0.913 0.904 0.861 1.019 0.869 0.859 1.018 0.896 0.828 0.836
2 1983-02-15 00:00:00 0.854 0.881 0.91 0.917 0.888 0.912 0.921 0.902 0.945 0.882 0.831 0.903 0.872 0.871 0.925 0.858 0.871 0.884 0.864 0.858 0.861 0.869 0.897 0.864 0.873 0.876 0.865 0.829 0.873 0.882 0.857 0.847 0.853 0.857 0.887 0.836 0.861 0.862 0.865 0.845 0.859 0.849 0.86 0.846 0.881 0.838 0.826 0.828 0.846 0.816 0.811 0.841 0.807 0.992 0.814 0.811 0.926 0.846 0.753 0.765
3 1983-03-15 00:00:00 0.829 0.856 0.878 0.88 0.855 0.881 0.894 0.875 0.906 0.849 0.803 0.886 0.854 0.843 0.895 0.815 0.856 0.869 0.853 0.844 0.845 0.854 0.866 0.839 0.841 0.847 0.843 0.808 0.85 0.857 0.836 0.827 0.831 0.834 0.854 0.815 0.833 0.843 0.837 0.821 0.844 0.821 0.84 0.826 0.833 0.814 0.793 0.813 0.782 0.774 0.772 0.802 0.778 0.891 0.763 0.793 0.885 0.808 0.718 0.719
4 1983-04-15 00:00:00 0.865 0.884 0.901 0.899 0.901 0.901 0.914 0.901 0.916 0.873 0.845 0.914 0.88 0.872 0.915 0.829 0.892 0.896 0.889 0.89 0.887 0.894 0.879 0.886 0.887 0.887 0.9 0.865 0.885 0.901 0.891 0.879 0.884 0.877 0.895 0.868 0.881 0.884 0.89 0.854 0.875 0.872 0.865 0.865 0.853 0.847 0.828 0.857 0.794 0.816 0.793 0.833 0.816 0.915 0.787 0.836 0.878 0.831 0.755 0.739
In [ ]:
unitExpensesPerMonthWithGas = unitExpensesPerMonth.copy()
# Adding the state of each Unit so we can combine the prices of gas
unitExpensesPerMonthWithGas = unitExpensesPerMonthWithGas.merge(unit_df[['Business Unit', 'State']], on='Business Unit', how='left')
In [ ]:
import pandas as pd

def get_substring_before_total(s):
    # Find the index of ' Total'
    index = s.find(' Total')

    # If 'Total' is not found, return the original string
    if index == -1:
        return s

    # Return the substring up to the index of 'Total'
    return s[:index]

# Assuming gasData is your DataFrame
# Example DataFrame
# gasData = pd.DataFrame({'Column1': ['value with Total and more', 'another value'], 'Column2': ['second column Total', 'another value in column 2']})

# Applying the function to each cell in the first row
for col in gasData.columns:
    gasData.at[0, col] = get_substring_before_total(gasData.at[0, col])

print(gasData)
               Sourcekey  ... EMA_EPM0_PWG_SWA_DPG
0                   Date  ...           Washington
1    1983-01-15 00:00:00  ...                0.836
2    1983-02-15 00:00:00  ...                0.765
3    1983-03-15 00:00:00  ...                0.719
4    1983-04-15 00:00:00  ...                0.739
..                   ...  ...                  ...
467  2021-11-15 00:00:00  ...                2.522
468  2021-12-15 00:00:00  ...                 2.46
469  2022-01-15 00:00:00  ...                2.625
470  2022-02-15 00:00:00  ...                2.657
471  2022-03-15 00:00:00  ...                3.275

[472 rows x 61 columns]
In [ ]:
# Set the first row as the header
gasData.columns = gasData.iloc[0]

# Drop the first row
gasData = gasData.drop(gasData.index[0])

# Reset the index if needed
gasData = gasData.reset_index(drop=True)

# Convert to datetime
gasData['Date'] = pd.to_datetime(gasData['Date'])

# Format the date as 'YYYY-MM'
gasData['Date'] = gasData['Date'].dt.strftime('%Y-%m')
In [ ]:
gasData.head()
Out[ ]:
Date U.S. East Coast (PADD 1) New England (PADD 1A) Connecticut Maine Massachusetts New Hampshire Rhode Island Vermont Central Atlantic (PADD 1B) Delaware District of Columbia Maryland New Jersey New York Pennsylvania Lower Atlantic (PADD 1C) Florida Georgia North Carolina South Carolina Virginia West Virginia Midwest (PADD 2) Illinois Indiana Iowa Kansas Kentucky Michigan Minnesota Missouri Nebraska North Dakota Ohio Oklahoma South Dakota Tennessee Wisconsin Gulf Coast (PADD 3) Alabama Arkansas Louisiana Mississippi New Mexico Texas Rocky Mountain (PADD 4) Colorado Idaho Montana Utah Wyoming West Coast (PADD 5) Alaska Arizona California Hawaii Nevada Oregon Washington
0 1983-01 0.885 0.91 0.944 0.957 0.918 0.939 0.959 0.942 0.994 0.92 0.859 0.932 0.907 0.911 0.964 0.901 0.889 0.907 0.877 0.866 0.873 0.896 0.944 0.888 0.896 0.896 0.892 0.847 0.895 0.906 0.894 0.869 0.881 0.915 0.914 0.855 0.896 0.877 0.895 0.861 0.871 0.869 0.876 0.852 0.916 0.854 0.891 0.865 0.937 0.887 0.913 0.904 0.861 1.019 0.869 0.859 1.018 0.896 0.828 0.836
1 1983-02 0.854 0.881 0.91 0.917 0.888 0.912 0.921 0.902 0.945 0.882 0.831 0.903 0.872 0.871 0.925 0.858 0.871 0.884 0.864 0.858 0.861 0.869 0.897 0.864 0.873 0.876 0.865 0.829 0.873 0.882 0.857 0.847 0.853 0.857 0.887 0.836 0.861 0.862 0.865 0.845 0.859 0.849 0.86 0.846 0.881 0.838 0.826 0.828 0.846 0.816 0.811 0.841 0.807 0.992 0.814 0.811 0.926 0.846 0.753 0.765
2 1983-03 0.829 0.856 0.878 0.88 0.855 0.881 0.894 0.875 0.906 0.849 0.803 0.886 0.854 0.843 0.895 0.815 0.856 0.869 0.853 0.844 0.845 0.854 0.866 0.839 0.841 0.847 0.843 0.808 0.85 0.857 0.836 0.827 0.831 0.834 0.854 0.815 0.833 0.843 0.837 0.821 0.844 0.821 0.84 0.826 0.833 0.814 0.793 0.813 0.782 0.774 0.772 0.802 0.778 0.891 0.763 0.793 0.885 0.808 0.718 0.719
3 1983-04 0.865 0.884 0.901 0.899 0.901 0.901 0.914 0.901 0.916 0.873 0.845 0.914 0.88 0.872 0.915 0.829 0.892 0.896 0.889 0.89 0.887 0.894 0.879 0.886 0.887 0.887 0.9 0.865 0.885 0.901 0.891 0.879 0.884 0.877 0.895 0.868 0.881 0.884 0.89 0.854 0.875 0.872 0.865 0.865 0.853 0.847 0.828 0.857 0.794 0.816 0.793 0.833 0.816 0.915 0.787 0.836 0.878 0.831 0.755 0.739
4 1983-05 0.904 0.912 0.932 0.926 0.922 0.939 0.944 0.927 0.954 0.897 0.848 0.944 0.911 0.894 0.945 0.859 0.922 0.93 0.915 0.917 0.918 0.927 0.9 0.919 0.923 0.916 0.937 0.898 0.917 0.931 0.928 0.915 0.921 0.914 0.916 0.902 0.918 0.917 0.925 0.883 0.907 0.913 0.895 0.887 0.88 0.875 0.87 0.898 0.838 0.856 0.847 0.87 0.895 0.853 0.858 0.916 0.911 0.906 0.84 0.828
In [ ]:
unitExpensesPerMonthWithGas.head().T
Out[ ]:
0 1 2 3 4
Business Unit 15.0 15.0 15.0 15.0 18.0
Category Insurance Expense Non-CAM Expenses Operating Expenses Real Estate Tax Expense Insurance Expense
2000-01 0.0 0.0 0.0 0.0 -4175.0
2000-02 0.0 0.0 0.0 0.0 -4175.0
2000-03 0.0 0.0 0.0 0.0 -4175.0
... ... ... ... ... ...
2023-05 -12948.47 -482.55 -25715.77 -38166.58 -22037.25
2023-06 -18159.12 -1933.35 -77736.53 -38166.59 -31604.16
2023-07 -14685.33 -479.55 -96710.95 -45473.42 -25226.22
2023-08 -14685.33 -6986.82 -72839.18 -39210.42 -25226.22
State Pennsylvania Pennsylvania Pennsylvania Pennsylvania Florida

287 rows × 5 columns

In [ ]:
unitExpensesPerMonthWithGas.isnull().sum()
Out[ ]:
Business Unit    0
Category         0
2000-01          0
2000-02          0
2000-03          0
                ..
2023-05          0
2023-06          0
2023-07          0
2023-08          0
State            0
Length: 287, dtype: int64
In [ ]:
len(unitExpensesPerMonthWithGas['Business Unit'].unique())
Out[ ]:
431

Merging Gas prices with unit monthly expenses and values to determine effect of gas price on each of the spendings

In [ ]:
import pandas as pd

count = 0
for unit in unitExpensesPerMonthWithGas['Business Unit'].unique():
    count = count + 1
    print(count)
    # Get the state for the current unit
    state = unitExpensesPerMonthWithGas[unitExpensesPerMonthWithGas['Business Unit'] == unit]['State'].iloc[0]

    # Initialize the new row with constant values
    new_row = {'Business Unit': unit, 'State': state, 'Category': 'Petrol'}

    # Flag to check if a matching date is found
    match_found = False

    # Iterate through each year-month column
    for ym_col in unitExpensesPerMonthWithGas.columns[unitExpensesPerMonthWithGas.columns.str.match('\d{4}-\d{2}')]:
        if state in gasData.columns:
            # Find the matching row in gasData based on the date
            matching_row = gasData[gasData['Date'] == ym_col]

            if not matching_row.empty:
                # Retrieve the price for the state from the matching row
                price = matching_row[state].iloc[0]

                # Add the price to the new row
                new_row[ym_col] = price
                match_found = True

        else:
            print(f"State {state} not found in gasData columns")

    # Append the new row to unitExpensesPerMonthWithGas only if a match was found
    if match_found:
        unitExpensesPerMonthWithGas = unitExpensesPerMonthWithGas.append(new_row, ignore_index=True)

print(unitExpensesPerMonthWithGas)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
      Business Unit                 Category  ...   2023-08         State
0              15.0        Insurance Expense  ... -14685.33  Pennsylvania
1              15.0         Non-CAM Expenses  ...  -6986.82  Pennsylvania
2              15.0       Operating Expenses  ... -72839.18  Pennsylvania
3              15.0  Real Estate Tax Expense  ... -39210.42  Pennsylvania
4              18.0        Insurance Expense  ... -25226.22       Florida
...             ...                      ...  ...       ...           ...
2129      8001801.0                   Petrol  ...       NaN       Georgia
2130      8010201.0                   Petrol  ...       NaN   Connecticut
2131      8010202.0                   Petrol  ...       NaN   Connecticut
2132      8010203.0                   Petrol  ...       NaN   Connecticut
2133      8010204.0                   Petrol  ...       NaN   Connecticut

[2134 rows x 287 columns]
In [ ]:
sorted_df = unitExpensesPerMonthWithGas.sort_values(by='Business Unit')
In [ ]:
print(sorted_df.isnull().sum())
Business Unit      0
Category           0
2000-01            0
2000-02            0
2000-03            0
                ... 
2023-05          431
2023-06          431
2023-07          431
2023-08          431
State              0
Length: 287, dtype: int64

Filling null values with 0 in order to plot it

In [ ]:
sorted_df = sorted_df.fillna(0)

From the plot below it seems price per galon is very small and doesn't show on the plot, We will multiply the price per galon by a value for us to see its effect on the rest of values.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming sorted_df is your DataFrame and it's already sorted by 'Business Unit'

# Identify columns that match the 'YYYY-MM' pattern and sort them
date_columns = sorted(sorted_df.columns[sorted_df.columns.str.match('\d{4}-\d{2}')])

# Iterate over each unique value in the 'Business Unit' column
for unit in sorted_df['Business Unit'].unique():
    # Selecting the relevant rows for the unit
    unit_df = sorted_df[sorted_df['Business Unit'] == unit]
    plt.figure()  # Create a new figure for each Business Unit

    # Iterate over each unique category within the unit
    for category in unit_df['Category'].unique():
        # Selecting the relevant rows for the category
        category_df = unit_df[unit_df['Category'] == category]

        # Plotting the line for the category
        plt.plot(date_columns, category_df[date_columns].iloc[0], label=category)

    # Add legends, labels, and title for each Business Unit
    plt.legend()
    plt.xlabel('Date')
    plt.ylabel('Values')
    plt.title(f'Line Plot for Business Unit {unit} by Date')
    #plt.xticks(ticks=range(len(date_columns)), labels=date_columns, rotation=45)
    plt.xticks(rotation=90)
    # Rotate labels for better readability
    #plt.gcf().autofmt_xdate()
    # Set x-axis to show every third tick
    plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))
    # Show the plot for each Business Unit
    plt.show()
    break
No description has been provided for this image
In [ ]:
df = sorted_df.copy()
In [ ]:
sorted_df.head().T
Out[ ]:
0 3 1703 1 2
Business Unit 15.0 15.0 15.0 15.0 15.0
Category Insurance Expense Real Estate Tax Expense Petrol Non-CAM Expenses Operating Expenses
2000-01 0.0 0.0 0.776 0.0 0.0
2000-02 0.0 0.0 0.845 0.0 0.0
2000-03 0.0 0.0 0.963 0.0 0.0
... ... ... ... ... ...
2023-05 -12948.47 -38166.58 0.0 -482.55 -25715.77
2023-06 -18159.12 -38166.59 0.0 -1933.35 -77736.53
2023-07 -14685.33 -45473.42 0.0 -479.55 -96710.95
2023-08 -14685.33 -39210.42 0.0 -6986.82 -72839.18
State Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania

287 rows × 5 columns

Multiplier is being negative so we can compare trends easily and to avoid the plot getting too small to see any observations.

In [ ]:
sorted_df = df.copy()
# getting date columns
date_columns = sorted(sorted_df.columns[sorted_df.columns.str.match('\d{4}-\d{2}')])
#Multiplying petrol values by a multiplier
multiplier = -10000

# Select rows where Category is 'Petrol' and multiply
sorted_df.loc[sorted_df['Category'] == 'Petrol', date_columns] *= multiplier

Plotted only 20 as plotting all the units break the notebook due to the size of it.

Looking at the plots and the aplified price of gasoline we can deduce the following:

  • Gasoline price has medium effect on Insurance expense.
  • Gasoline price has no effect on Tax expense
  • Gasoline price has no effect on Non-Cam expenses
  • Gasoline price has high effect on operating expenses
  • Operating expeses go up with Gasoline price but doesn't go down with it, this can be attributed to the fact other businesses can increase price to adjust to increasing petrol price but rarely lower it.
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Create a dictionary that maps categories to colors
category_colors = {
    'Insurance Expense': 'blue',
    'Real Estate Tax Expense': 'green',
    'Petrol': 'black',
    'Non-CAM Expenses': 'orange',
    'Operating Expenses': 'red',
    # Add more categories and their corresponding colors as needed
}

# Identify columns that match the 'YYYY-MM' pattern and sort them
date_columns = sorted(sorted_df.columns[sorted_df.columns.str.match('\d{4}-\d{2}')])
count = 0
# Iterate over each unique value in the 'Business Unit' column
for unit in sorted_df['Business Unit'].unique():

    # plotting every unit breaks the notbook
    count = count + 1
    if (count >20):
        break
    # Selecting the relevant rows for the unit
    unit_df = sorted_df[sorted_df['Business Unit'] == unit]
    plt.figure()  # Create a new figure for each Business Unit

    # Iterate over each unique category within the unit
    for category in unit_df['Category'].unique():
        # Selecting the relevant rows for the category
        category_df = unit_df[unit_df['Category'] == category]

        # Use the dictionary to retrieve the color for the category
        color = category_colors.get(category, 'black')

        # Plotting the line for the category with the specified color
        plt.plot(date_columns, category_df[date_columns].iloc[0], label=category, color=color)

    # Add legends, labels, and title for each Business Unit
    plt.legend()
    plt.xlabel('Date')
    plt.ylabel('Values')
    plt.title(f'Line Plot for Business Unit {unit} by Date')
    #plt.xticks(ticks=range(len(date_columns)), labels=date_columns, rotation=45)
    plt.xticks(rotation=90)
    # Rotate labels for better readability
    #plt.gcf().autofmt_xdate()
    # Set x-axis to show every third tick
    plt.gca().xaxis.set_major_locator(ticker.MultipleLocator(6))
    # Show the plot for each Business Unit
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

We will calculate seasonality and trend for each of the categories

In [ ]:
trend_df = sorted_df.drop(columns=['State', 'Business Unit'])
In [ ]:
date_columns = sorted(df.columns[df.columns.str.match('\d{4}-\d{2}')])
#Getting mean of the expenses for all units per month by category
trend_df = trend_df.groupby('Category')[date_columns].mean()
trend_df = trend_df.T
In [ ]:
trend_df.head()
Out[ ]:
Category Insurance Expense Non-CAM Expenses Operating Expenses Petrol Real Estate Tax Expense
2000-01 -251.157143 -105.191740 -2127.562668 -7999.234339 -3014.519169
2000-02 -251.157143 -230.288561 -2627.131647 -8887.285383 -3340.863064
2000-03 -251.157143 -187.583782 -2407.397541 -10119.628770 -3164.363040
2000-04 -251.157143 -206.295824 -2429.281160 -9181.322506 -2831.686532
2000-05 -251.157143 -213.932807 -2527.201392 -9897.795824 -3126.370784
In [ ]:
# packages for time series analysis
from statsmodels.graphics import tsaplots
import statsmodels.api as sm

def plot_trends(trend):
    trend_df.index = pd.to_datetime(trend_df.index)
    # Perform time series decompositon
    decomposition = sm.tsa.seasonal_decompose(trend_df[trend]) # change this cammand

    # Plot decomposition
    fig = decomposition.plot()

    # Plot settings
    %config InlineBackend.figure_format='retina'
    plt.style.use("ggplot")
    plt.rcParams["figure.figsize"] = 12, 9  # Figure size (width,height)
    plt.xlabel("Date", fontsize=12)  # Text and size of xlabel
    plt.suptitle(
        "Daily Price Time Series Decomposition",  # Text of title
        y=1.05,  # Position of title
        size=15,  # Size of title
    )
    plt.show()
In [ ]:
import matplotlib.dates as mdates
def plot_trends(trend):
    # Ensure the index is a datetime object
    trend_df.index = pd.to_datetime(trend_df.index)

    # Perform time series decomposition
    decomposition = sm.tsa.seasonal_decompose(trend_df[trend], model='additive')

    # Plot decomposition
    fig = decomposition.plot()

    # Improve plot aesthetics and settings
    plt.style.use("ggplot")
    plt.rcParams["figure.figsize"] = (12, 9)  # Figure size (width,height)

    # Adjusting x-axis to display labels every 6 months
    plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=6))  # Set major ticks to every 6 months
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'))  # Format ticks to show month and year

    # Rotate the tick labels for x-axis
    for label in plt.gca().get_xticklabels():
        label.set_rotation(90)

    plt.xlabel("Date", fontsize=12)  # Text and size of xlabel
    plt.suptitle(
        "Daily Price Time Series Decomposition",  # Text of title
        y=1.05,  # Position of title
        size=15,  # Size of title
    )
    plt.show()

Based on trend and seaasonality analysis we infer the following:

  • For all expenses there is increasing trend in the value of expenses
  • For all expenses there is a degree of seasonality, where pattern repeats around December and June. It usually lowers at June and peaks at december
  • Insurance Expense and Real state Tax have low Residue while Non-CAM expense and operating expenses has high varience in residue
  • petrol has seasonality around same time as expenses, but there is no obious trend for the pricing. 'This is the amplified petrol price'
In [ ]:
arr = ['Insurance Expense','Non-CAM Expenses','Operating Expenses','Real Estate Tax Expense','Petrol']

for val in arr:
    plot_trends(val)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
corr = trend_df.corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap for numerical Values')
plt.show()
No description has been provided for this image
In [ ]:
trend_df.head()
Out[ ]:
Category Insurance Expense Non-CAM Expenses Operating Expenses Petrol Real Estate Tax Expense
2000-01-01 -251.157143 -105.191740 -2127.562668 -7999.234339 -3014.519169
2000-02-01 -251.157143 -230.288561 -2627.131647 -8887.285383 -3340.863064
2000-03-01 -251.157143 -187.583782 -2407.397541 -10119.628770 -3164.363040
2000-04-01 -251.157143 -206.295824 -2429.281160 -9181.322506 -2831.686532
2000-05-01 -251.157143 -213.932807 -2527.201392 -9897.795824 -3126.370784

Based on the findings so far a Machine learning model can be created to predict the expenses of Regency square for following reasons:

  • Insurance Expense:
    • Based on the plots and residue a Logisitc regression model might be sufficient for predicting the expenses. However, the sudden spikes in the expenses cannot be explained by my current level of knowledge of the business data.
  • Non-CAM Expenses:
    • For the same reasons as Insurance expense a logistic regression model or a linear regression might be sufficient in predicting expenses.
  • Operating Expenses:
    • By far operating expenses are hardest to predict due to high varience, but despite correlation heatmap, we believe an increase in petrol price affects the expenses and increases the value. We believe the low correlation comes from expenses not lowering when prices get lower and the slow adaptation to petrol prices and their effect on expenses.
  • Real State Tax:
    • We cannot understand underlying patterns in the data and the reason for frequent spikes in expenses. We think that is due to lack of our knowledge about the business information in order to infer any information from it.

Based on the correlation heatmap, It seems all expenses values are relevant greatly. Thus we determine when making a machine learning model, it is required to use other expenses of the other categories as inputs.

WIP 3

We will work on creating a model to predict the expenses based on the extracted attributes

In [ ]:
sorted_df.head().T
Out[ ]:
0 3 1703 1 2
Business Unit 15.0 15.0 15.0 15.0 15.0
Category Insurance Expense Real Estate Tax Expense Petrol Non-CAM Expenses Operating Expenses
2000-01 0.0 0.0 -7760.0 0.0 0.0
2000-02 0.0 0.0 -8450.0 0.0 0.0
2000-03 0.0 0.0 -9630.0 0.0 0.0
... ... ... ... ... ...
2023-05 -12948.47 -38166.58 -0.0 -482.55 -25715.77
2023-06 -18159.12 -38166.59 -0.0 -1933.35 -77736.53
2023-07 -14685.33 -45473.42 -0.0 -479.55 -96710.95
2023-08 -14685.33 -39210.42 -0.0 -6986.82 -72839.18
State Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania

287 rows × 5 columns

In [ ]:
unit_df.head()
Out[ ]:
Business Unit Division Region State Zip Code Lat Long Year Constructed Year Acquired Expenses Near Water
0 15.0 Northeast PA-DE Pennsylvania 19087 40.0612 -75.3999 1960.0 2004.0 -1 0
1 18.0 Southeast Tampa Florida 33511 27.9056 -82.2881 1986.0 1993.0 -1 1
2 20.0 Southeast Jacksonville Florida 32606 29.6954 -82.4023 1986.0 1994.0 -1 1
3 43.0 Southeast Jacksonville Florida 32605 29.6785 -82.3679 1974.0 1993.0 -1 1
4 51.0 Southeast Tampa Florida 34104 26.1529 -81.7417 1992.0 1994.0 -1 1
In [ ]:
# Merging dataframes to combine the attributes into a single dataframe
merged_df = sorted_df.merge(unit_df[['Business Unit', 'State', 'Year Constructed', 'Near Water']],
                            on='Business Unit',
                            how='left')
In [ ]:
merged_df.head().T
Out[ ]:
0 1 2 3 4
Business Unit 15.0 15.0 15.0 15.0 15.0
Category Insurance Expense Real Estate Tax Expense Petrol Non-CAM Expenses Operating Expenses
2000-01 0.0 0.0 -7760.0 0.0 0.0
2000-02 0.0 0.0 -8450.0 0.0 0.0
2000-03 0.0 0.0 -9630.0 0.0 0.0
... ... ... ... ... ...
2023-08 -14685.33 -39210.42 -0.0 -6986.82 -72839.18
State_x Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania
State_y Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania
Year Constructed 1960.0 1960.0 1960.0 1960.0 1960.0
Near Water 0 0 0 0 0

290 rows × 5 columns

In [ ]:
# Moving the last 3 columns to the beggining
columns = merged_df.columns.tolist()

# Reorder the columns to move the last three to the beginning
new_columns = columns[-3:] + columns[:-3]

# Create a new DataFrame with the reordered columns
merged_df = merged_df[new_columns]

# Display the updated DataFrame
merged_df.head().T
Out[ ]:
0 1 2 3 4
State_y Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania
Year Constructed 1960.0 1960.0 1960.0 1960.0 1960.0
Near Water 0 0 0 0 0
Business Unit 15.0 15.0 15.0 15.0 15.0
Category Insurance Expense Real Estate Tax Expense Petrol Non-CAM Expenses Operating Expenses
... ... ... ... ... ...
2023-05 -12948.47 -38166.58 -0.0 -482.55 -25715.77
2023-06 -18159.12 -38166.59 -0.0 -1933.35 -77736.53
2023-07 -14685.33 -45473.42 -0.0 -479.55 -96710.95
2023-08 -14685.33 -39210.42 -0.0 -6986.82 -72839.18
State_x Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania

290 rows × 5 columns

In [ ]:
# Removing duplicate column
merged_df = merged_df.drop(columns=['State_x'])
In [ ]:
merged_df.head().T
Out[ ]:
0 1 2 3 4
State_y Pennsylvania Pennsylvania Pennsylvania Pennsylvania Pennsylvania
Year Constructed 1960.0 1960.0 1960.0 1960.0 1960.0
Near Water 0 0 0 0 0
Business Unit 15.0 15.0 15.0 15.0 15.0
Category Insurance Expense Real Estate Tax Expense Petrol Non-CAM Expenses Operating Expenses
... ... ... ... ... ...
2023-04 -12948.47 -38166.58 -0.0 -720.05 -73458.0
2023-05 -12948.47 -38166.58 -0.0 -482.55 -25715.77
2023-06 -18159.12 -38166.59 -0.0 -1933.35 -77736.53
2023-07 -14685.33 -45473.42 -0.0 -479.55 -96710.95
2023-08 -14685.33 -39210.42 -0.0 -6986.82 -72839.18

289 rows × 5 columns

In [ ]:
# Creating a date column and getting rid of all the year-month columns
df_melted = pd.melt(merged_df, id_vars=['State_y', 'Year Constructed', 'Near Water', 'Business Unit', 'Category'],
                     var_name='Date', value_name='Value')
In [ ]:
df_melted.head()
Out[ ]:
State_y Year Constructed Near Water Business Unit Category Date Value
0 Pennsylvania 1960.0 0 15.0 Insurance Expense 2000-01 0.0
1 Pennsylvania 1960.0 0 15.0 Real Estate Tax Expense 2000-01 0.0
2 Pennsylvania 1960.0 0 15.0 Petrol 2000-01 -7760.0
3 Pennsylvania 1960.0 0 15.0 Non-CAM Expenses 2000-01 0.0
4 Pennsylvania 1960.0 0 15.0 Operating Expenses 2000-01 0.0
In [ ]:
df = df_melted.copy()
In [ ]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
In [ ]:
# Transforming categorical attributes to numerical
df['State_y'] = encoder.fit_transform(df['State_y'])


df['Category'] = encoder.fit_transform(df['Category'])
In [ ]:
# Convert the date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Optionally, extract features from the date (like year, month, day, etc.)
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
In [ ]:
X = df[['year', 'month', 'State_y', 'Year Constructed', 'Near Water','Business Unit','Category']]
y = df['Value']
In [ ]:
# Time Series split takes into consideration the time when doing the splits which is good for our case
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import LinearRegression

Used MSE to get the scores for the models. However,it is hard to determine how good the model is using those scores, but the lower value the better.

In [ ]:
tscv = TimeSeriesSplit(n_splits=10)

# Create your machine learning model (e.g., Linear Regression)
model = LinearRegression()

# Perform cross-validation
scores = cross_val_score(model, X=X, y=y, cv=tscv, scoring='neg_mean_squared_error')

formatted_scores = [f'{score:.2f}' for score in scores]
# Print the cross-validation scores
print("Cross-validation scores:", formatted_scores)
Cross-validation scores: ['-48087238.06', '-97633884.03', '-172249454.47', '-172575673.78', '-209691621.38', '-223197640.65', '-238692090.66', '-675365995.26', '-689415896.40', '-836801141.05']
In [ ]:
# Plotting the values of the models to get more visual information about how good is our model.
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Assume X and y are defined

# Create a TimeSeriesSplit iterator
def train_and_plot_model(model,X,y):
    tscv = TimeSeriesSplit(n_splits=5)

    model = model

    # Storing actual and predicted values for plotting
    actual_values = []
    predicted_values = []
    count = 0

    for train_index, test_index in tscv.split(X):
        count = count+1
        print(count)
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        predictions = model.predict(X_test)

        actual_values.extend(y_test)
        predicted_values.extend(predictions)

    # Plotting the actual vs predicted values
    plt.figure(figsize=(10, 6))
    plt.plot(actual_values, label='Actual Values')
    plt.plot(predicted_values, label='Predicted Values', alpha=0.7)
    plt.title('Actual vs Predicted Values')
    plt.xlabel('Time Step')
    plt.ylabel('Target Variable')
    plt.legend()
    plt.show()
In [ ]:
train_and_plot_model(LinearRegression(),X,y)
No description has been provided for this image
In [ ]:
train_and_plot_model(RandomForestRegressor(),X,y)
No description has been provided for this image
In [ ]:
train_and_plot_model(GradientBoostingRegressor(),X,y)
No description has been provided for this image

**** In the plots Random Forest shows the bet estimation in predicting the expenses. Given that the model is optimized further using hyper paprameter or provided more accurate data. It is expected for this model to become a very good estimator of the expenses